All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch
نویسندگان
چکیده
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, although NLP-inspired research has focused on adding more complex readability features, there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and crowdsourcing, we implement different types of text characteristics ranging from easy-tocompute superficial text characteristics to features requiring deep linguistic processing, resulting in ten different feature groups. Both a regression and classification set-up are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task that provides considerable insights in which feature combinations contribute to the overall readability prediction. Because we also have gold standard information available for those features requiring deep processing, we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully automatic readability prediction pipeline is on par with the pipeline using gold-standard deep syntactic and semantic information.
منابع مشابه
Workshop Predicting and Improving Readability
s Scott Crossley Crowdsourcing text complexity models The current study builds on work by De Clercq et al. (2014) and Crossley et al. (2017) by using crowdsourcing techniques to collect human ratings of text comprehension, processing, and familiarity across a large corpus comprising a diverse variety of topic domains (science, technology, and history). Pairwise comparisons among the ratings wer...
متن کاملA full ranking method using integrated DEA models and its application to modify GA for finding Pareto optimal solution of MOP problem
This paper uses integrated Data Envelopment Analysis (DEA) models to rank all extreme and non-extreme efficient Decision Making Units (DMUs) and then applies integrated DEA ranking method as a criterion to modify Genetic Algorithm (GA) for finding Pareto optimal solutions of a Multi Objective Programming (MOP) problem. The researchers have used ranking method as a shortcut way to modify GA to d...
متن کاملA General Investigation on the Combination of Local and Global Feature Selection Methods for Request Identification in Telegram
Nowadays, the use of various messaging services is expanding worldwide with the rapid development of Internet technologies. Telegram is a cloud-based open-source text messaging service. According to the US Securities and Exchange Commission and based on the statistics given for October 2019 to present, 300 million people worldwide used telegram per month. Telegram users are more concentrated in...
متن کاملNeural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten
Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...
متن کاملCreating Algorithmic Symbols to Enhance Learning English Grammar
This paper introduces a set of English grammar symbols that the author has developed to enhance students’ understanding and consequently, application of the English grammar rules. A pretest-posttest control-group design was carried out in which the samples were students in two girls’ senior high schools (N=135, P ≤ 0.05) divided into two groups: the Treatment which received gramm...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computational Linguistics
دوره 42 شماره
صفحات -
تاریخ انتشار 2016